Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections

نویسندگان

  • Helena Ahonen
  • Oskari Heinonen
  • Mika Klemettinen
  • A. Inkeri Verkamo
چکیده

Traditionally, texts have been analysed using various information retrieval related methods, such as full-text analysis, and natural language processing. However, only few examples of data mining in text, particularly in full text, are available. In this paper we show that general data mining methods are applicable to text analysis tasks such as descriptive phrase extraction. Moreover, we present a general framework for text mining. The framework follows the general knowledge discovery process, thus containing steps from preprocessing to the utilization of the results. The data mining method that we apply is based on generalized episodes and episode rules. We give concrete examples of how to preprocess texts based on the intended use of the discovered results and we introduce a weighting scheme that helps in pruning out redundant or non-descriptive phrases. We also present results from real-life data experiments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Background Readings for Collection Synthesis

Automatic collection building will be key to efficient and affordable digital libraries of the future. Since hand-curated collections will not scale with the Web, effective automated techniques are highly desired. This section deals with crawling the Web to extract topics, related documents, or collections. To do that, technologies such as feature extraction, page categorization, andWeb crawlin...

متن کامل

Text Mining

“Bag of words” model, acronym extraction, authorship ascription, coordinate matching, data mining, document clustering, document frequency, document retrieval, document similarity metrics, entity extraction, hidden Markov models, hubs and authorities, information extraction, information retrieval, key-phrase assignment, key-phrase extraction, knowledge engineering, language identification, link...

متن کامل

Feature extraction in opinion mining through Persian reviews

Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...

متن کامل

College of Library and Information Services University of Maryland College Park, Md

This paper surveys techniques for applying data mining techniques to large text collections and illustrates how those techniques can be used to support the management of science and technology research. Specific issues that arise repeatedly in the conduct of research management are described, and a textual data mining architecture that extends a classic paradigm for knowledge discovery in datab...

متن کامل

A Systematic study of Text Mining Techniques

Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998